Introduction

Healthcare access is a fundamental concern for many, as access to healthcare impacts a person’s ability to live a healthy life. For example, it can increase access to medications or suggestions from medical professionals on treatment (Baiker et al., 2013). Many factors can also influence individuals’ access to healthcare, such as race, age, and insurance status, whether private or public. However, determining if any relationships exist between demographic factors and insurance status may provide additional insights into a mechanism for various social determinants of health to impact healthcare access. Moreover, understanding these relationships can help convince policymakers to consider population-targeted health insurance expansion efforts to improve health outcomes for individuals.

The 2020 Social Determinants of Health database (SDOH) is the dataset we chose for this investigation. SDOH was assembled by the Agency for Healthcare Research and Quality (AHRCQ, 2023), which falls under the jurisdiction of the US Department of Health and Human Services. This dataset was particularly useful for our research as it was developed to make it easier to find a range of well-documented, readily linkable social determinants of health variables across domains without accessing multiple source files, facilitating SDOH research and analysis. We chose 2020 because it was the most recent year in the database. The data is sourced from nationwide surveys, such as the US Census and the PLACES dataset (collected by the Center for Disease Control).

SDOH was created to provide easily accessible data that can be used for research on social determinants of health, visualize patterns about health outcomes, and contribute to them. Furthermore, this dataset has different types of variables that fall into five main categories, which allows us to examine the relationship of variables across categories. The five main categories are:

1. Social Context = demographics, socioeconomic status

2. Economic Context = employment, income

3. Education = level of education, literacy

4. Infrastructure = food accessibility, housing

5. Healthcare Context = insurance, distance to healthcare facilities

We centered our analysis of this dataset around the healthcare context by investigating how private healthcare insurance coverage varies with two other determinant categories of health: social context (race and English proficiency) and education (level of education). We are motivated to analyze how private insurance is affected by social determinants of health to understand what factors are associated with one’s access to healthcare or lack thereof. In summary, the purpose of this work and our central research question is the following:

What is the relationship between social and educational indicators and private health insurance coverage in the United States?

Analyzing SDOH can provide insights into potential associations between various social determinants of health and health insurance coverage, a specific mechanism that affects healthcare access. While this data set cannot provide any insights into causal relationships between these variables or how any findings might relate to health outcomes beyond access, the initial associations these relationships can determine can be a building block for this future work. This is due to the ability of findings from investigative analysis on SDOH to highlight what populations or social determinant factors to consider in efforts to improve human health.

Setup for Analysis

To set up R Studio to complete our analysis, the tidyverse and plotly packages were loaded into the working environment. These packages contain numerous functions in R Studio that allow for deepened analysis, including creating subsets of variables and plots and calculating various summary statistics for variables simultaneously.

Following the loading of our selected packages, we loaded SDOH into R Studio, the integrated development environment used for data analysis in this project, using the “read.csv” function, which imports all the dataset values, variables, rows, and columns into R Studio for analysis. The structure of the dataset had countless percetanges of populations by demographics for counties across the United States.

#Load in packages for analysis 
library(tidyverse)
library(plotly) 

#Obtain data set
sdoh <- read_csv("data/sdohc.csv")

Data Cleaning of Key Initial Variables

Ensuring data is cleaned is a practical measure to minimize errors in analyzing or interpreting research findings. The “drop_na” function was used to clean the SDOH dataset, our key variable of interest, the percentage of a population with private health insurance. This function removes all the NA values from the variable, which are all the “not applicable” or blank submissions, to ensure that summary statistics and plots can be analyzed while minimizing bias from including blank submissions alongside filled ones.

#Drop rows with NA values in variables of interest 
sdoh <- sdoh %>%
  drop_na(ACS_PCT_PRIVATE_ANY) %>% 
  drop_na(COUNTY)

After this point, no further data cleaning or reformatting was completed. Instead, the next step was to examine the private health insurance ownership variable to assess whether further data cleaning was needed. It is noted that after further investigation, cleaning was not conducted, as only correlational plots were viewed in this analysis.

Examining Structure of Private Insurance Variable

The first step in examining the private health insurance variable was to calculate various summary statistics using the “summary()” function. In order to understand the general patterns of data and have an overview of a variable’s key features and distributions, we need to look at statistical summaries of key variables in our dataset.

#Calculate summary statistics for % pop. w/private health insurance variable
summary(sdoh$ACS_PCT_PRIVATE_ANY)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.68   50.04   57.87   57.23   65.45   98.38

The summary output shows the median (57.87) and mean (57.23) of the percentage of the county population with private health insurance very close to one another. This indicates a normal distribution of the variable, which can also be visualized in the histogram plot before using the “ggplot” and “geom_histogram” functions to create the plot.

#Visualize distribution of insurance coverage in US by county
ggplot(sdoh, aes(x=ACS_PCT_PRIVATE_ANY)) + 
  geom_histogram(color = "grey", fill = "light blue")+
  xlab("% of county population with private health insurance")+
  ylab("number of counties")+
  ggtitle("Distribution of private health insurance coverage by county in the US")+
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Beyond statistical summaries of the variable overall, it can also be important to highlight where private health insurance coverage is greatest and lowest in the United States to help identify potential locations of interest in future analysis. The counties with the highest and lowest private health insurance coverage were also determined using various functions.

First, a new vector was created, which only includes the two variables of interest: the county and the percentage of the county population with private health insurance. Then, a new, smaller data frame was created with only the variables in the new vector.

The “arrange(desc)” function was responsible for organizing the insurance coverage in descending order, so when the “head()” function is used, only the top five counties with high private insurance coverage are printed.

#Create vector with counties and insurance
keep <- c("COUNTY", "ACS_PCT_PRIVATE_ANY")

#Create new data frame with selected variables
insurance_df <- sdoh[keep]

#Sort data in descending order by insurance coverage
insurance_df <- insurance_df %>%
  arrange(desc(ACS_PCT_PRIVATE_ANY))
  
#View top 5 counties with highest insurance coverage
head(insurance_df)
## # A tibble: 6 × 2
##   COUNTY            ACS_PCT_PRIVATE_ANY
##   <chr>                           <dbl>
## 1 Kalawao County                   98.4
## 2 Falls Church city                87.8
## 3 Daggett County                   85.7
## 4 Los Alamos County                85.3
## 5 Burke County                     83.4
## 6 Carver County                    83.4

In the same way, using just the “arrange()” function organizes the insurance coverage in ascending order, so when the “head()” function is used, only the lowest five counties with high private insurance coverage are printed.

#Sort data in ascending order by insurance coverage
insurance_df <- insurance_df %>%
  arrange(ACS_PCT_PRIVATE_ANY)

#View top 5 counties with lowest insurance coverage
head(insurance_df)
## # A tibble: 6 × 2
##   COUNTY               ACS_PCT_PRIVATE_ANY
##   <chr>                              <dbl>
## 1 Oglala Lakota County                12.7
## 2 Todd County                         14.7
## 3 Kusilvak Census Area                15.2
## 4 Adjuntas Municipio                  17.3
## 5 Vieques Municipio                   18.0
## 6 Buffalo County                      18.7

This information provides information regarding the geographic distribution of private health insurance ownership by county, in conjunction with an earlier exploration of the quantitative distribution of the variable. Having a basic understanding of the variable distribution and seeing that they were normally distributed, investigative analysis was performed between our selected demographic variables of interest and health insurance.

Relationships Between Private Health Insurance and Other Variables in the Dataset

1. Various Racial Groups

One of the most important factors that shape private insurance status is an individual’s racial background. We wanted to understand the relationship between private insurance coverage and race because, according to a study conducted by the NIH, “blacks, Hispanics, and some Asian populations when compared with whites, appear to have lower levels of health insurance coverage” (National Research Council (US) Panel on Race, Ethnicity, and Health in Later Life, 2004).

Understanding the relationship between private insurance coverage and race will help shed light on which racial groups have the lowest coverage, and this can lead to the development of policies to bridge the coverage gap. In this dataset, the percentage of each racial group is encoded as the variables “ACS_PCT_WHITE” representing White, “ACS_PCT_BLACK” representing Black, “ACS_PCT_ASIAN” representing Asian, “ACS_PCT_AIAN” representing American Indian and Alaska Native, and finally “ACS_PCT_MULT_RACE” representing multiple races.

Understanding the distribution of race within each county can help provide a baseline insight into the population demographics and context for any interpretation of the analysis. The distribution of racial groups across the county was determined using the pivot_longer() function, which put all the races in one column. We named this new combined variable “sdoh_long.”

Next, we renamed the original race to reflect actual races set by the Census Bureau: White, Black, Asian, American Indian/Alaska Native, and Multiracial. This makes it more comprehensible to our audiences. Then, we created a histogram plot using the ggplot() function. The x-axis represents the percentage of the county’s population belonging to a racial group, while the y-axis shows the number of counties. Within this function, the plotted data’s color, size, and opacity can also be customized. Finally, x and y-axis labels, as well as a title, were added using the xlab(), ylab(), and ggtitle() functions.

#visualization of each racial group in percentage for counties
#Convert data to long format
sdoh_long <- sdoh %>%
  pivot_longer(cols = c(ACS_PCT_WHITE, ACS_PCT_BLACK, ACS_PCT_ASIAN, ACS_PCT_AIAN, ACS_PCT_MULT_RACE),
               names_to = "Race_Ethnicity",
               values_to = "Percentage")

#Rename categories for better readability
sdoh_long$Race_Ethnicity <- factor(sdoh_long$Race_Ethnicity, 
                                   levels = c("ACS_PCT_WHITE", "ACS_PCT_BLACK", "ACS_PCT_ASIAN", "ACS_PCT_AIAN", "ACS_PCT_MULT_RACE"),
                                   labels = c("White", "Black", "Asian", "American Indian/Alaska Native", "Multiracial"))

#Create a single faceted histogram plot
ggplot(sdoh_long, aes(x = Percentage)) + 
  geom_histogram(color = "grey", fill = "lightblue", bins = 30) +
  facet_wrap(~Race_Ethnicity, nrow = 2) +  # Arrange plots in a 2-row layout
  xlab("% of County Population") +
  ylab("Number of Counties") +
  ggtitle("Distribution of Racial/Ethnic Groups by County in the US") +
  theme_minimal()

Based on the histogram distribution, it is observed that each race has a skew to its data. The white population has a left skew, showing that most of the data falls around 100% of the population for many counties, which indicates that they are the predominant race in these counties. This is quite the opposite for all the other racial groups, which have a right skew showing their percentage is much closer to 0, which shows they are the minorities and have a lower representation in these counties, indicating a potential disparity in the dataset observations that contextualizes the later research findings within a population predominantly made up of White individuals.

Now that we have a bird’s-eye view of the broad distribution of the data, we wanted to look at more summary statistics to give us more information on each racial group and have an understanding of the distribution of the variable. The next step was using the summary() function to compute the median, mean, minimum, maximum, and first and third quartiles for each racial group. By analyzing these values, we can better understand the demographic makeup of different counties and how it may relate to private insurance coverage. We used the print() and cat() functions to help print all five summary outputs calculated at once.

#Print summary statistics for the distribution of every category of race at once
{
  cat("\n % of Population Identifying as White\n")
  print(summary(sdoh$ACS_PCT_WHITE))

  cat("\n % of Population Identifying as Black\n")
  print(summary(sdoh$ACS_PCT_BLACK))

  cat("\n % of Population Identifying as Asian\n")
  print(summary(sdoh$ACS_PCT_ASIAN))

  cat("\n % of Population Identifying as American Indian/Alaska Native (AIAN)\n")
  print(summary(sdoh$ACS_PCT_AIAN))

  cat("\n % of Population Identifying as Multiracial\n")
  print(summary(sdoh$ACS_PCT_MULT_RACE))
}
## 
##  % of Population Identifying as White
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.29   73.62   87.66   81.20   93.99  100.00 
## 
##  % of Population Identifying as Black
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.690   2.380   9.085  10.210  87.790 
## 
##  % of Population Identifying as Asian
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.270   0.630   1.386   1.320  42.600 
## 
##  % of Population Identifying as American Indian/Alaska Native (AIAN)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.140   0.320   1.913   0.810  94.540 
## 
##  % of Population Identifying as Multiracial
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.910   2.890   3.647   4.430  40.270

Looking at the summary statistics for each race and the median, it is seen that the most represented race in the data set is white-identifying individuals. Having a median of 87.66% in comparison to the least represented population, American Indian/Alaska Native, having a median of 0.320%. Looking at the % population distribution of White individuals, it is seen that they have a mean lower than the median, which indicates a left skew, which was seen in the histogram above. In addition, the other races all had a mean higher than the median, indicating a right skew. We chose to look specifically at the median because it resists outliers. With this information, we wanted to plot the distribution of each racial group’s private insurance coverage based on the county level.

Next, to integrate these findings with our broader interest in health insurance, we wanted to see the relationship between private insurance coverage and racial groups in the top 5 counties with the highest private insurance coverage rates. To achieve this, we used arrange(desc()) and slice(1:5) to identify these counties. The dataset was then transformed into a long format using pivot_longer(), allowing easier visualization. An interactive bar chart was created using geom_bar() and ggplotly(), displaying the percentage of each racial group in these top counties.

#Selecting the racial variables, county and private insurance
selected_racial_group_with_county <- sdoh %>%
  select(COUNTY, ACS_PCT_WHITE, ACS_PCT_BLACK, ACS_PCT_ASIAN, ACS_PCT_AIAN, ACS_PCT_MULT_RACE, ACS_PCT_PRIVATE_ANY)

#Identify top 5 counties with highest private insurance coverage
top_counties <- selected_racial_group_with_county %>%
  arrange(desc(ACS_PCT_PRIVATE_ANY)) %>%
  slice(1:5)

top_counties
## # A tibble: 5 × 7
##   COUNTY            ACS_PCT_WHITE ACS_PCT_BLACK ACS_PCT_ASIAN ACS_PCT_AIAN
##   <chr>                     <dbl>         <dbl>         <dbl>        <dbl>
## 1 Kalawao County             13.3          0.69          3.44        17.7 
## 2 Falls Church city          78.0          4.82          9.88         0.09
## 3 Daggett County             97.6          0.17          0            0.34
## 4 Los Alamos County          84.2          0.81          4.92         0.94
## 5 Burke County               93.4          0.75          1.54         1.07
## # ℹ 2 more variables: ACS_PCT_MULT_RACE <dbl>, ACS_PCT_PRIVATE_ANY <dbl>
#Convert race columns into long format for visualization
selected_long <- top_counties %>%
  pivot_longer(cols = c(ACS_PCT_WHITE, ACS_PCT_BLACK, ACS_PCT_ASIAN, ACS_PCT_AIAN, ACS_PCT_MULT_RACE),
               names_to = "Race",
               values_to = "Percentage")


#Creating an interactive bar plot
p <- ggplot(selected_long, aes(x = Race, y = Percentage, fill = Race, 
                               text = paste("County:", COUNTY, 
                                            "<br>Private Insurance:", round(ACS_PCT_PRIVATE_ANY, 2)))) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ COUNTY, scales = "free_y") +  # Facet by County only
  labs(title = "Race Distribution in Top 5 Counties with Highest Private Insurance",
       x = NULL,  # Removes x-axis label to avoid repetition
       y = "Percentage of Population") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),  # Removes race labels at the bottom
        axis.ticks.x = element_blank(),
        strip.text = element_text(size = 10, face = "bold"),  # Adjust facet title size
        legend.position = "bottom")  # Moves legend to bottom

#Convert to interactive plotly graph
interactive_plot <- ggplotly(p, tooltip = "text")

interactive_plot

The interactive plot follows the pattern established previously about the distribution of each race; the most represented are Whites. However, in Kalawao County, we see that the race with the highest representation is AIAN, which is an interesting departure from the observed trend. Also, compared to the other counties, the other races, like Black, Asian, and Multirace individuals, had a higher representation in Kalawao County. An interactive plot, as opposed to other plots, is used because it allows the audience to toggle, zoom in, and perform many other actions between different elements on the graph.

Lastly, we wanted to get the actual correlation between each racial group and private insurance coverage to provide knowledge of the strength of this relationship. We used the cor() function to calculate the correlation coefficient for each racial group. The cat() function assists in printing all four summary outputs simultaneously. The numbers range from –1 to 1. The closer to one, the higher the correlation of the two variables. If the value is negative, it indicates that as one variable increases, the other decreases. However, if you have a positive value, as one increases, so does the other variable. The racial group with the highest correlation with private insurance having a moderate positive correlation was White, having a value of 0.42.

#Print correlation coefficients for every association between race and private insurance at once 
{
  cat("\nCorrelation: % White vs. Private Insurance\n")
  print(cor(sdoh$ACS_PCT_WHITE, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))

  cat("\nCorrelation: % Black vs. Private Insurance\n")
  print(cor(sdoh$ACS_PCT_BLACK, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))

  cat("\nCorrelation: % Asian vs. Private Insurance\n")
  print(cor(sdoh$ACS_PCT_ASIAN, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))

  cat("\nCorrelation: % American Indian/Alaska Native vs. Private Insurance\n")
  print(cor(sdoh$ACS_PCT_AIAN, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))

  cat("\nCorrelation: % Multiracial vs. Private Insurance\n")
  print(cor(sdoh$ACS_PCT_MULT_RACE, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
}
## 
## Correlation: % White vs. Private Insurance
## [1] 0.4260394
## 
## Correlation: % Black vs. Private Insurance
## [1] -0.2949309
## 
## Correlation: % Asian vs. Private Insurance
## [1] 0.2303914
## 
## Correlation: % American Indian/Alaska Native vs. Private Insurance
## [1] -0.2617146
## 
## Correlation: % Multiracial vs. Private Insurance
## [1] -0.1810702

Understanding the correlation between racial groups and private insurance will help us improve healthcare equity and access by highlighting whether or not various populations have more or less coverage than others. We predicted race and private insurance coverage having a strong correlation, with Whites having a strong positive correlation and the other races having a strong negative correlation. However, our findings proved otherwise; we saw a moderate positive correlation between White and private insurance coverage and Asians having a low positive correlation (0.23). Blacks and American Indian/Alaska Natives had a negative low correlation (-0.29 and 0.261, respectively). These findings might be because certain racial groups are more likely to have public insurance coverage than others, with White and Asian populations demonstrating the highest correlation with increased health insurance coverage. Consequently, it appears that various populations of color have, unjustly, have less coverage than others, even in counties with the highest private insurance coverage rates.

2: Various Education Levels

Now that we know more about different racial groups and their relationship with private insurance coverage, we wanted to shift our focus to the education level, as educational attainment is another key social determinant of health that shapes life outcomes for many individuals. An interest in investigating this relationship also arose due to a research study finding that increased educational attainment was associated with higher rates of health insurance coverage, particularly in conjunction with the factor of employment (Dewar, 1998). Based on this finding, we predicted that higher educational attainment would also be associated with higher private health insurance rates specifically.

In this particular dataset, the percentage of people in each county with a designated education level is encoded through the following variables: “ACS_PCT_LT_HS” (less than high school education), “ACS_PCT_HS_GRADUATE” (only high school diploma), “ACS_PCT_BACHELOR_DGR” (holds a bachelor’s degree), and “ACS_PCT_GRADUATE_DGR” (holds a graduate or professional degree).

Similar to our interest in looking at the distribution of the “race” variable, we aimed to start with a similar investigation of the “education level” variables to provide insight into what skews might be present in the data that provide context for later interpretations. To visualize the distribution of each educational variable, the ggplot() function used with geom_histogram produces a set of histogram distribution plots that show the distribution of the population percentages within each educational level by county in the dataset. The pivot_longer() function was used to put all the educational levels in one column. We named this new combined variable “sdoh_long,” used consistently throughout the project when creating a new combined variable. Next, we renamed the original variables to educational levels and created a histogram plot using the ggplot() function in order to have clearer titles of the variable names.

#Visualize distribution of education level in US by county
#Convert data to long format to put into a facet wrapped plot 
sdoh_long <- sdoh %>%
  pivot_longer(cols = c(ACS_PCT_LT_HS, ACS_PCT_HS_GRADUATE, ACS_PCT_BACHELOR_DGR, ACS_PCT_GRADUATE_DGR),
               names_to = "Education_Level",
               values_to = "Percentage")

#Rename education levels for better readability/understanding 
sdoh_long$Education_Level <- factor(sdoh_long$Education_Level, 
                                    levels = c("ACS_PCT_LT_HS", "ACS_PCT_HS_GRADUATE", "ACS_PCT_BACHELOR_DGR", "ACS_PCT_GRADUATE_DGR"),
                                    labels = c("Less than High School", "High School Graduate", "Bachelor's Degree", "Graduate Degree"))

#Create the faceted histogram plot with all four plots in one figure 
ggplot(sdoh_long, aes(x = Percentage)) + 
  geom_histogram(color = "grey", fill = "light blue", bins = 30) +
  facet_wrap(~Education_Level, nrow = 2) +
  xlab("% of county population") +
  ylab("Number of counties") +
  ggtitle("Distribution of Education Levels by County in the US") +
  theme_minimal()

In the distribution for graduate and bachelor’s degrees, there is a right skew to the distribution, which indicates that on the county level, few people have graduate and bachelor’s degrees compared to the other levels of education. The same observation can be seen with people with less than a high school degree. However, the percentage of the county population of high school graduates appears to have a more normal distribution. From this, we can reason that the population we are examining essentially comprises those with educational attainment below a bachelor’s degree. Consequently, this difference in population size across education levels may be the reason for any eventual difference detected between individuals in different educational level groups.

Next, a summary output of basic statistical metrics such as mean and median percentage for each education level group was calculated to visualize the distribution numerically, which provides greater context to our findings from the histogram distributions. The summary() function, in this case, provides a basic sense of the variables’ numeric distribution. The cat() function assists in printing all four summary outputs simultaneously.

#Print all summary statistics for % of population with various education levels at once 
{
  cat("\n % of Population with Less than High School Diploma\n")
  print(summary(sdoh$ACS_PCT_LT_HS))

  cat("\n % of Population with Only High School Diploma\n")
  print(summary(sdoh$ACS_PCT_HS_GRADUATE))

  cat("\n % of Population with a Bachelor's Degree\n")
  print(summary(sdoh$ACS_PCT_BACHELOR_DGR))

  cat("\n % of Population with a Graduate or Professional Degree\n")
  print(summary(sdoh$ACS_PCT_GRADUATE_DGR))
}
## 
##  % of Population with Less than High School Diploma
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.39    7.99   11.28   12.71   16.30   78.15 
## 
##  % of Population with Only High School Diploma
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.52   29.25   34.20   33.84   39.07   54.96 
## 
##  % of Population with a Bachelor's Degree
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.53   13.74   14.68   17.91   43.00 
## 
##  % of Population with a Graduate or Professional Degree
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.98    6.63    7.93    9.54   47.64

This information shows that the mean is above the median for populations with less than a high school diploma, a bachelor’s degree, and a graduate or professional degree, indicating that each distribution is slightly right-skewed. This means counties with high percentages of these population groups are considered outliers compared to those with lower percentages. For those with only a high school diploma, the median is greater than the mean, which indicates a slight left skew and that counties with lower percentages of these population groups are considered outliers. Once again, we show that the population primarily comprises those with only a high school diploma or less, indicating a lower education level among the SDOH respondents. This may have affected the data in a way that may not be seen with more respondents with higher educational attainment.

After visualizing the distribution for each educational level, we were curious about the correlation between these educational levels and private health insurance ownership to understand the relationship between education and health insurance coverage. The following ggplot() functions used with geom_point create correlation plots for the relationship between the education level variables and private insurance ownership. The ggplot() functions are responsible for creating the plot, with the lab (), lab (), and ggtitle() functions being used to describe the ax’s names and titles for the plot.

#Plotting all correlation association plots together on the same plot for visualization 
# Convert data to long format to put all plots together 
sdoh_long <- sdoh %>%
  pivot_longer(cols = c(ACS_PCT_LT_HS, ACS_PCT_HS_GRADUATE, ACS_PCT_BACHELOR_DGR, ACS_PCT_GRADUATE_DGR),
               names_to = "Education_Level",
               values_to = "Education_Percentage")

# Rename education levels for better readability/understanding 
sdoh_long$Education_Level <- factor(sdoh_long$Education_Level, 
                                    levels = c("ACS_PCT_LT_HS", "ACS_PCT_HS_GRADUATE", "ACS_PCT_BACHELOR_DGR", "ACS_PCT_GRADUATE_DGR"),
                                    labels = c("Less than High School", "High School Graduate", "Bachelor's Degree", "Graduate Degree"))

# Create the faceted scatter plot with trend lines for each relationship
ggplot(sdoh_long, aes(x = Education_Percentage, y = ACS_PCT_PRIVATE_ANY)) +
  geom_point(aes(color = Education_Level), alpha = 0.5, size = 2) + 
  geom_smooth(method = "lm", color = "black", se = FALSE) +  
  facet_wrap(~Education_Level, nrow = 2) +
  scale_color_manual(values = c("green", "blue", "red", "purple")) +  
  xlab("% of Population with Specified Education Level") +
  ylab("% of Population with Private Insurance") +
  ggtitle("Association between Education and Private Insurance Ownership") +
  theme_minimal() +
  theme(legend.position = "none") 
## `geom_smooth()` using formula = 'y ~ x'

Based on the visual depictions seen in the correlation plots, higher percentages of the population comprised of those with less than a high school education and those with only a high school diploma is associated with a decreased percentage of the population owning private health insurance. Higher percentages of the population having a bachelor’s, graduate, or professional degree are associated with an increased percentage of the population owning private health insurance. This indicates that higher education is associated with greater health insurance coverage. Each plot also contains a trendline to depict the general relationship between education and private health insurance: higher education levels are associated with higher private health insurance ownership.

Finally, the next evaluation was to assess the correlation between education and private health insurance ownership through a correlation coefficient value to understand the strength of the relationship beyond just its direction. Using the cor() and cat() functions together also prints the correlation coefficients for each of these relationships.

#Print all correlation coefficients between education level and private insurance rates at once 
{
  cat("\n % of Population with Less than High School Education\n")
  print(cor(sdoh$ACS_PCT_LT_HS, sdoh$ACS_PCT_PRIVATE_ANY))

  cat("\n % of Population with Only High School Diploma\n")
  print(cor(sdoh$ACS_PCT_HS_GRADUATE, sdoh$ACS_PCT_PRIVATE_ANY))

  cat("\n % of Population with a Bachelor's Degree\n")
  print(cor(sdoh$ACS_PCT_BACHELOR_DGR, sdoh$ACS_PCT_PRIVATE_ANY))

  cat("\n % of Population with a Graduate or Professional Degree\n")
  print(cor(sdoh$ACS_PCT_GRADUATE_DGR, sdoh$ACS_PCT_PRIVATE_ANY))
}
## 
##  % of Population with Less than High School Education
## [1] -0.6949437
## 
##  % of Population with Only High School Diploma
## [1] -0.2915207
## 
##  % of Population with a Bachelor's Degree
## [1] 0.5637536
## 
##  % of Population with a Graduate or Professional Degree
## [1] 0.4332939

As shown in the output, the correlation coefficients for the variables, including those with less or only a high school diploma, have negative coefficients (-0.69 and –0.29, respectively). In contrast, those with a bachelor’s, graduate, or professional degree have positive coefficients (0.56 and 0.43 respectively). These findings indicate that our initial prediction of higher education being associated with increased private health insurance ownership rates is potentially correct. This suggests that populations with lower educational attainment may have lower health insurance coverage rates and could benefit from targeted support to increase healthcare access. Further investigation is needed to assess further specifics of this relationship and how to minimize this disparity potentially.

3: Limited English Speaking

This dataset encodes the percentage of limited English-speaking households per county as “ACS_PCT_HH_LIMT_ENGLISH.” The summary() function provides a basic sense of the distribution of this variable, which allows for contextualization of the findings from this analysis within the scope of our population.

#Summary statistics for % of limited English-speaking households
summary(sdoh$ACS_PCT_HH_LIMIT_ENGLISH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.290   0.870   3.469   2.160  90.340

The distribution is right-skewed since the mean (3.5%) is above the median (0.87%). This means that most counties have a relatively low percentage of limited English-speaking households, and counties with a very high percentage are considered outliers. As such, our later findings can be interpreted within this context while leaving space to understand that in a dataset that has a greater representation of counties with a high percentage of limited English-speaking households, results may differ.

To visualize this distribution graphically, which provides the same goal of contextualizing our research findings, we created a histogram using the “ggplot2” package in R. We used the function ggplot(), which takes the dataset and variable of interest as input and plots it along the x-axis. The function geom_histogram specifies the type of graph we want to make: a histogram. Within this function, we specified colors for our graph. We added x and y-axis labels using xlab() and ylab(). Finally, we added a title using ggtitle(). Each function is linked to our original graph using the “+” operator.

#Distribution of percentage of limited English-speaking households 
ggplot(sdoh, aes(ACS_PCT_HH_LIMIT_ENGLISH)) + #selects variable to put on x axis
  geom_histogram(color = 'grey', fill = 'light blue')+ #sets graph color 
  xlab("% of limited English-speaking households") + #adds x axis label
  ylab("number of counties")+ #adds y axis label 
  ggtitle("Distribution of US Counties by Percentage of Limited English-Speaking Households")+ theme_minimal() #sets graph aesthetics built into ggplot package
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This graph depicts the right-skewed distribution of this variable, as demonstrated by the summary statistics. The percentage of limited English-speaking households is below 25% for most counties in this dataset, indicating that English-speaking proficiency is generally low in the SDOH population.

Next, two methods were used to evaluate the relationship between limited English-speaking and health insurance coverage: a visual method (by graph) and a quantitative method (by calculating the correlation coefficient). For the graph, we used the function geom_point(), which creates a scatter plot. Within this function, the plotted data’s color, size, and opacity were also customized. Finally, x and y-axis labels, as well as a title, were added using the xlab(), ylab(), and ggtitle() functions. The purpose of this plot is to understand the direction of the relationship between English proficiency and private health insurance coverage.

#Correlation between limited English-speaking and private insurance coverage
ggplot(sdoh, aes(x = ACS_PCT_HH_LIMIT_ENGLISH, y = ACS_PCT_PRIVATE_ANY)) +
  geom_point(color = 'blue',alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", color = "black", se = FALSE) +
  xlab("% of limited English-speaking households") +
  ylab("% of population with private insurance")+
  ggtitle("Association between English-speaking and private insurance ownership")+
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

This graph illustrates a negative correlation between the percentage of individuals from limited English-speaking households and those with private insurance coverage by county. In other words, this indicates that as the number of limited English-speaking households increases, the percentage of the county with private health insurance decreases immensely, indicating a potential inequity regarding health insurance access among populations with limited English profieincy.

The last piece of analysis for evaluating the relationship between limited English-speaking and insurance coverage was calculating the correlation coefficient to assess the relationship’s strength. This provides a quantitative measure of the relationship illustrated on the scatterplot. R has a built-in function called “cor()” that automatically calculates the degree of association between two variables.

##Calculate correlation coefficient 
cor(sdoh$ACS_PCT_HH_LIMIT_ENGLISH, sdoh$ACS_PCT_PRIVATE_ANY)
## [1] -0.3865378

The correlation coefficient was calculated as –0.39, which indicates a moderate negative association between limited English-speaking and insurance coverage. In other words, as the percentage of limited English-speaking households in a county increases, the percentage of residents with healthcare insurance coverage decreases.

The association between the percentage of individuals from limited English-speaking households and those with private insurance coverage reflected in the graph and by the correlation coefficient is consistent with our prediction. Further investigation is necessary to understand the causal relationship between these variables and how this disparity in health insurance might negatively affect healthcare outcomes for limited English-speaking populations.

Conclusion

Overall, the key findings of this exploratory analysis include that populations with higher levels of limited English-speaking households, those with less than or only a high school diploma for education, and higher percentages of White individuals are associated with higher levels of private health insurance ownership. Our analysis of correlations between private insurance and each determinant category shows that various marginalized populations may face issues with access to health insurance and, subsequently, access to healthcare.

Working with these findings, several initial questions arise that could guide future research. For instance, the relation between these health outcomes and having limited English-speaking households, having less than or only a high school diploma for education, or being a person of color is largely unknown. Moreover, the mechanism that may restrict or create barriers for members of these populations to access healthcare insurance is also underexplored. In addition, it may be interesting to investigate further the intersectionality of each category and how that will affect the correlation with private insurance if an individual falls between more than one variable. Future investigations that aim to answer these questions provide more depth to the findings from this analysis, ultimately highlighting clearer paths to target to increase health insurance access for all people equitably.

Additionally, further investigation could address limitations of this work, including reshaping the distributions to be normally distributed for each variable. Investigating these relationships and determining more information would enable us to further our knowledge about ways we can use this information to improve healthcare access and equity for groups who are marginalized and are typically underrepresented populations of private health insurance owners.

Works Cited

Baicker K., Taubman S.L., Allen H.L., Bernstein M., Gruber J.H., Newhouse J.P., Schneider E.C., Wright B.J., Zaslavsky A.M., Finkelstein A.N. for the Oregon Health Study Group (2013, May 2). The Oregon experiment–effects of Medicaid on clinical outcomes. New England Journal of Medicine, 368(18), 1713–1722. https://pmc.ncbi.nlm.nih.gov/articles/PMC3701298/

Dewar, M. D. (1998). Do those with more formal education have better health insurance opportunities? Economics of Education Review, 17(3), 267-277. https://doi.org/10.1016/S0272-7757(97)00034-4

National Research Council (US) Panel on Race, Ethnicity, and Health in Later Life; Bulatao, R. A., & Anderson, N. B. (Eds.). (2004). Race, ethnicity, and health in later life. Washington, DC: National Academies Press (US).

Ramirez, N., Shi, K., Yabroff, K. R., Han, X., Fedewa, S. A., & Nogueira, L. M. (2023). Access to care among adults with limited English proficiency. Journal of General Internal Medicine, 38(3), 592–599. https://doi.org/10.1007/s11606-022-07690-3

Agency for Healthcare Research and Quality. (2023, June). Social Determinants of Health Database. Rockville, MD. https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html